Method of synthesis for a steady sound signal

ABSTRACT

The present invention relates to a method of synthesizing a first sound signal based on a second sound signal, the first sound signal having a required first fundamental frequency and the second sound signal having a second fundamental frequency, the method comprising the steps of, a) determining of required pitch bell locations in the time domain of the first sound signal, the pitch bell locations being distanced by one period of the first fundamental frequency, b) providing of pitch bells by windowing the second sound signal on pitch bell locations in the time domain of the second sound signal, the pitch bell locations being distanced by one period of the second fundamental frequency, c) randomly selecting of a pitch bell from the provided pitch bells for each of the required pitch bell locations, d) performing an overlap and add operation on the selected pitch bells for synthesizing the first signal.

The present invention relates to the field of synthesizing of speech ormusic, and more particularly without limitation, to the field oftext-to-speech synthesis.

The function of a text-to-speech (TTS) synthesis system is to synthesizespeech from a generic text in a given language. Nowadays, TTS systemshave been put into practical operation for many applications, such asaccess to databases through the telephone network or aid to handicappedpeople. One method to synthesize speech is by concatenating elements ofa recorded set of subunits of speech such as demisyllables orpolyphones. The majority of successful commercial systems employ theconcatenation of polyphones. The polyphones comprise groups of two(diphones), three (triphones) or more phones and may be determined fromnonsense words, by segmenting the desired grouping of phones at stablespectral regions. In a concatenation based synthesis, the conversationof the transition between two adjacent phones is crucial to assure thequality of the synthesized speech. With the choice of polyphones as thebasic subunits, the transition between two adjacent phones is preservedin the recorded subunits, and the concatenation is carried out betweensimilar phones.

Before the synthesis, however, the phones must have their duration andpitch modified in order to fulfil the prosodic constraints of the newwords containing those phones. This processing is necessary to avoid theproduction of a monotonous sounding synthesized speech. In a TTS system,a prosodic module performs this function. To allow the duration andpitch modifications in the recorded subunits, many concatenation basedTTS systems employ the time-domain pitch-synchronous overlap-add(TD-PSOLA) (E. Moulines and F. Charpentier, “Pitch synchronous waveformprocessing techniques for text-to-speech synthesis using diphones,”Speech Commun., vol. 9, pp. 453-467, 1990) model of synthesis. When thesignal to be synthesized is required to have an extended duration thisis accomplished by repeating the pitch bells, which have been obtainedfrom the original signal. This repetition process is illustrated inFIG. 1. Time axis 100 belongs to the time domain of the original signal.The original signal has a length of T spanning the time interval betweenzero and T on the time axis 100. Further, the original signal has afundamental frequency f, which corresponds to a period p; pitch bellsare obtained from the original signal by windowing the original signalby means of windows 102. In the example considered here the windows arespaced apart by the period p in the domain of time axis 100. This waythe pitch bell locations i are determined on time axis 100. Time axis104 belongs to the time domain of the signal to be synthesized. Thesignal to be synthesized is required to have a duration of yT, where ycan be any number. Next a number of pitch bell locations j is determinedon the time axis 104. Like on the time axis 100, the pitch belllocations j are spaced apart by the period p corresponding to thefundamental frequency f of the original signal. In order to increase theduration of the original signal each of the original pitch bellsobtained from the original signal is repeated a number of y times. Thisresults in a number of intervals 106, 108, . . . in the domain of timeaxis 104, whereby each of the intervals 106, 108, . . . is composed ofrepetitions of identical pitch bells. For example the interval 106contains repetitions of the pitch bell obtained from the pitch belllocation i=1 from the original signal at pitch bell locations j (i=1,k=1) to j (i=1, k=y). This means that interval 106 contains a number ofy repetitions of the pitch bell obtained from pitch bell location i=1 ontime axis 100 of the original signal. Likewise the following interval108 contains a number of y repetitions of the pitch bell obtained frompitch bell location i=2 from the original signal. As a consequence thesynthesized signal is composed of concatenated sequences of pitch bellrepetitions.

A common disadvantage of such PSOLA methods is that an extreme durationmanipulation introduces audible transitions between the sequences intothe signal. In particular this is a problem when the original sound is ahybrid sound like voiced fricatives having both a noisy and a periodiccomponent. The repetition of pitch bells introduces periodicity in thenoisy components, which makes the synthesized signal sound unnatural.

The present invention therefore aims to provide an improved method ofsynthesizing a sound signal, in particular for extreme durationmodifications, like for singing.

The present invention provides for a method of synthesizing a soundsignal based on an original signal in order to manipulate the durationof the original signal. In particular, the present invention enablesextreme duration and pitch modifications of the original signal withoutaudible artefacts. This is especially useful for synthesizing of singingwhere extreme duration manipulations in the order of 4 to 100 times ofthe original signal can occur.

In essence, the present invention is based on the observation that priorart PSOLA methods introduce artefacts into a synthesized signal afterduration manipulation because the transition from one chain of repeatingpitch bells to the next is audible. This effect which is experiencedwhen a prior art PSOLA type method is employed for extreme durationmanipulations is particularly detrimental for hybrid sounds containingboth a noisy and a periodic component.

In accordance with the invention, pitch bells are randomly selected fromthe original signal for each of the required pitch bell locations of thesignal to be synthesized. This way the introduction of periodicity inthe noisy components can be avoided and the naturalness of the originalsound is preserved. In accordance with a preferred embodiment of theinvention the original sound is a voiced fricative having both a noisyand a periodic component. Application of the present invention to suchvoiced fricatives is especially beneficial.

In accordance with a further preferred embodiment of the invention araised cosine is used for windowing of voiced fricatives. For unvoicedsound intervals a sine window is used which has the advantage that thetotal signal envelope in power domain remains about constant. Unlike aperiodic signal, when two noise samples are added, the total sum can besmaller than the absolute value of any of the two samples. This isbecause the signals are (mostly) not in-phase; the sine window adjustsfor this effect and removes the envelope-modulation.

In accordance with a further preferred embodiment of the invention theoriginal sound signal has periods which are spectrally alike and whichhave basically the same information content. Such periods, which arevoiced, are classified by a first classifier and such periods which areunvoiced are classified by means of a second classifier.

In accordance with a further preferred embodiment of the invention theclassification information of the original signal is stored in acomputer system, such as a text-to-speech system. Intervals of theoriginal signal which are classified as voiced or unvoiced steadyperiods being spectrally alike are processed in accordance with thepresent invention whereby a raised cosine window is used for voicedintervals and a sine window is used for unvoiced intervals.

In the following preferred embodiments of the invention are described ingreater detail by making reference to the drawings in which:

FIG. 1 is illustrative of a prior art PSOLA-type method,

FIG. 2 is illustrative of an example for synthesizing a sound signal inaccordance with an embodiment of the present invention,

FIG. 3 is illustrative of a flow chart of an embodiment of a method ofthe present invention,

FIG. 4 shows an example of an original signal and of the synthesizedsignal, and

FIG. 5 is a block diagram of a preferred embodiment of a computer system

FIG. 2 shows an example of synthesizing a signal based on an originalsignal. Time axis 200 is illustrative of the time domain of the originalsignal. The original signal has a duration T and spans the time betweenzero and T on time axis 200. The original signal has a fundamentalfrequency f which corresponds to a period p. The period p determineslocations i on time axis 200 for windowing of the original signal bymeans of window 202. In the example considered here, the original signalis a voiced hybrid sound such that a cosine window in accordance withthe following formula is used.

${{w\lbrack n\rbrack} = {0.5 - {0.5 \cdot {\cos\left( \frac{2{\pi \cdot \left( {n + 0.5} \right)}}{m} \right)}}}},\mspace{14mu}{0 \leq n < m}$

In previous relation, m is the length of the window and n is the runningindex.

When the original signal is an unvoiced sound signal it is preferred touse the following window.

${{w\lbrack n\rbrack} = {\sin\left( \frac{\pi \cdot \left( {n + 0.5} \right)}{m} \right)}},\mspace{31mu}{0 \leq n < m}$

The time domain of the signal to be synthesized is illustrated by timeaxis 204. The signal to be synthesized is required to have a duration ofyT, where y can be any number, for example y=4 or y=6 or y=20 or y=50 ory=100.

The period p does also determine the pitch bell locations j on time axis204. Like on time axis 200 the pitch bell locations are spaced apart byperiod p. For each of the required pitch bell locations j, a randomselection of a location of a pitch bell i in the time domain of the timeaxis 200 is made. In the example considered here there is a number of 6pitch bells which are obtained by windowing of the original signal inthe time domain of time axis 200. To select one of these obtained pitchbells for a pitch bell location j a random number between 1 and 6 isgenerated. This way a random selection from the available pitch bells onpitch bell locations i=1 to i=6 is made. This process is repeated forall required pitch bell locations j on time axis 204. For example apitch bell for the required pitch bell location j=1 is selected bygenerating a random number between 1 and 6. In the example consideredhere, the number 6 is obtained such that the pitch bell obtained frompitch bell location i=6 on the time axis 200 is selected for therequired pitch bell location j=1 on the time axis 204. Likewise a randomnumber is generated for the required pitch bell location j=2. The randomnumber is 4 in this example such that the pitch bell at pitch belllocation i=4 on time axis 200 is selected for the required pitch belllocation j=2. This process is performed for all required pitch belllocations j=1 to j=z on time axis 204. Due to the random selection ofthe pitch bells from the domain of the original signal, intervals 106,108, . . . are avoided (cf. FIG. 1). As a consequence no such artefactis introduced into the synthesized signal and the synthesized signalsounds naturally even for extreme duration manipulations.

FIG. 3 shows a flow chart, which is illustrative of this method. In step300 a recording of an original sound is provided. In step 302 hybridsound intervals are identified and classified as voiced or unvoiced inthe original sound recording. This can be done manually by a humanexpert or by means of a computer program, which analyses the originalsignal and/or its frequency spectrum for steady periods. Preferably thefirst analysis is performed by means of a program and a human expertreviews the output of a program. In step 304 pitch bells are obtainedfrom the original sound signal by means of windowing. Windowing isperformed by means of windows which are positioned synchronously withthe fundamental frequency of the original sound signal, i.e. the windowsare distanced by the period p of the original sound signal in the domainof the original sound signal. In step 306 the pitch bell locations j forwhich pitch bells are required in order to synthesize the signal aredetermined. Again the required pitch bell locations j are distanced bythe period p. Alternatively the pitch bell locations j can be distancedby another period q corresponding to a higher or lower requiredfundamental frequency of the signal to be synthesized. This way theduration and the frequency can be modified. In step 308 a randomselection of pitch bells is made for each of the required pitch belllocations j within the sound interval which is classified as hybrid. Forother sound intervals a prior art PSOLA-type method may or may not beemployed. In step 310 the pitch bells are overlapped and added on thepitch bell locations j in the domain of the signal to be synthesized.

FIG. 4 shows an example of an original sound signal 400 which is adiphone of /z/ to /z/ transition. Also the frequency spectrum 402 of thesound signal 400 is shown in FIG. 4. FIG. 4.

Sound signal 404 is obtained from sound signal 400 in accordance withthe present invention by randomly selecting pitch bells obtained fromthe sound signal 400 for the required pitch bell locations in the timedomain of the synthesized sound signal 404. In the example consideredhere the synthesized sound signal 404 is y=5 times longer than theoriginal sound signal 400. Also the frequency spectrum 406 of the soundsignal 404 is shown in FIG. 4. As apparent from the sound signal 404 andits frequency spectrum 406 the characteristics of the original soundsignal 400 are preserved in the synthesized signal and no artefacts areintroduced. As a consequence the sound signal 404 sounds identical tothe sound signal 400 but is 5 times longer.

FIG. 5 shows a block diagram of a computer system, such as atext-to-speech synthesis system. The computer system 500 comprises amodule 502 for storing of an original sound signal. Module 504 serves toenter and store sound classification information for the original soundsignal stored in module 502. For example, steady voiced periods aremarked with an ‘r’ and steady unvoiced periods are marked with an ‘s’ inthe original sound signal. Module 506 serves for windowing of theoriginal sound signal of module 502 in order to obtain pitch bells.Depending on the sound classification a raised cosine or a sine windowis used for steady voiced periods or steady unvoiced periods,respectively. Module 508 serves to determine the required pitch belllocations j in the time domain of the signal to be synthesized. In orderto determine the required pitch bell locations j the input parameter‘length y’ is utilized. The input parameter length y specifies themultiplication factor for the duration of the original signal. Furtherit is possible to provide a dynamically varying pitch as an additionalinput parameter to modify the fundamental frequency in addition to orinstead of the duration.

Module 510 serves to select pitch bells from the set of pitch bellsobtained from the original sound signal. Module 510 is coupled to pseudorandom number generator 512. For each of the required pitch belllocations in the domain of the signal to be synthesized, a pseudo randomnumber is generated by pseudo random number generator 512. By means ofthese random numbers selections of pitch bells from the set of pitchbells are made by module 510 in order to provide a randomly selectedpitch bell for each of the required pitch bell locations in the timedomain of the signal to be synthesized. Module 514 serves to perform anoverlap and add operation on the selected pitch bells in the time domainof the signal to be synthesized. This way the synthesized signal havingthe required duration is obtained.

It is to be noted that the present invention can be applied on steadyregions. For example, such a steady region can be a vowel or a noisyvoiced sound like /z/. Hence, the invention is not restricted to‘hybrid’ sounds.

Furthermore, it is to be noted that the synthesized signal does not needto have the same pitch (fundamental frequency) as the original. In someapplications it is required to change the pitch, for example in order tosynthesize singing. In order to accomplish this change of fundamentalfrequency in the synthesized signal, the period locations in thesynthesized signal will be placed more closely or more away from eachother than the original. This does not otherwise change the synthesisprocedure.

Further it is to be noted that the present invention is not restrictedto a certain choice of a window. Instead of raised cosine or sinewindows other windows can be used such as triangular windows.

1. A method of synthesizing a first sound signal based on a second soundsignal, the first sound signal having a required first fundamentalfrequency and the second sound signal having a second fundamentalfrequency, the method comprising the steps of: determining requiredpitch bell locations in the time domain of the first sound signal, thepitch bell locations being distanced by one period of the firstfundamental frequency, providing a plurality of pitch bells by windowingthe second sound signal based on pitch bell locations in the time domainof the second sound signal, the pitch bell locations of the second soundsignal being distanced by one period of the second fundamentalfrequency, said windowing being determined based on a type of saidsecond sound signal; randomly selecting one of said pitch bells from theprovided pitch bells for each of the required pitch bell locations, saidselection being uniformly distributed among said number of providedpitch bells; and performing an overlap and add operation on the selectedpitch bells for synthesizing the first signal.
 2. The method of claim 1,wherein the second sound signal is a hybrid sound comprising a noisy andperiodic component.
 3. The method of claims 1 wherein the second soundsignal comprises a voiced fricative sound signal.
 4. The method of claim1, wherein the second sound signal comprises voiced sound signal andwherein a raised cosine is used for windowing of the second soundsignal.
 5. The method of claim 1, wherein the second sound signalcomprises an unvoiced sound signal and wherein a sine window is used forwindowing of the second sound signal.
 6. The method of claim 1, whereinthe second sound signal has spectrally alike periods, the spectrallyalike periods having basically the same information content.
 7. Themethod of claim 1, wherein the required first fundamental frequency andthe second fundamental frequency are substantially the same.
 8. Acomputer system, in particular text-to-speech synthesis system, forsynthesizing a first sound signal based on a second sound signal, thefirst sound signal having a required first fundamental frequency and thesecond sound signal having a second fundamental frequency, the computersystem comprising: means for determining required pitch bell locationsin the time domain of the first sound signal, the pitch bell locationsbeing distanced by one period of the first fundamental frequency, meansfor providing a plurality of pitch bells by windowing the second soundsignal based on pitch bell locations in the time domain of the secondsound signal, the pitch bell locations of the second sound signal beingdistanced by one period of the second fundamental frequency, saidwindowing being determined based on a type of said second signal, meansfor randomly selecting one of a said pitch bells from the provided pitchbells for each of the required pitch bell locations, said selectionbeing uniformly distributed among said number of provided pitch bells;and means for performing an overlap and add operation on the selectedpitch bells for synthesizing the first signal.
 9. The computer system ofclaim 8 further comprising: means for storing of sound classificationdata, the means for storing of sound classification data being adaptedto store data being indicative of an interval containing the secondsound signal within an original sound signal.
 10. A method forconstruction a synthesizing signal comprising: determining a pluralityof pitch bell locations within an original sound signal, said locationsbeing distanced by one period of a fundamental frequency; determining aplurality of pitch bells associated with each of said pitch belllocations, said pitch bells being determined by windowing said originalsound signal, said windowing being determined based on a type of saidoriginal signal; determining a plurality of pitch bell locations withina signal to be synthesized, said locations being distanced by one periodof a frequency associated with said synthesized signal; randomlyselecting for each of a plurality of pitch bell locations within saidsynthesized signal one of said pitch bells associated with said originalsignal; and overlapping and adding said selected of pitch bells at saidsynthesized signal pitch bell locations.
 11. A device for synthesizing afirst sound signal based on a second sound signal, the devicecomprising: a first module configured to determine required pitch belllocations of the first sound signal; a windowing module configured toprovide a plurality of pitch bells by windowing the second sound signalbased on pitch bell locations of the second sound signal, said windowingbeing determined based on a type of said second signal, a selectorconfigured to randomly select one of said pitch bells from the providedpitch bells for each of the required pitch bell locations, saidselection being uniformly distributed among said number of providedpitch bells; and an adder configured to overlap and add the selectedpitch bells for synthesizing the first signal.
 12. The device of claim11, wherein the pitch bell locations of the first sound signal aredistanced by one period of a first fundamental frequency of the firstsound signal, and the pitch bell locations of the second sound signalare distanced by one period of a second fundamental frequency of thesecond sound signal.
 13. The device of claim 11, wherein the requiredpitch bell locations are in a time domain of the first sound signal. 14.The device of claim 11, wherein the windowing is based on the pitch belllocations in a time domain of the second sound signal.
 15. The device ofclaim 11, further comprising a module configured for storing of soundclassification data indicative of an interval containing the secondsound signal within an original sound signal.